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Abstract 

XML is of great importance in information storage and retrieval because of its recent emergence 
as a standard for data representation and interchange on the Internet. However XML provides little 
semantic content and as a result several papers have addressed the topic of how to improve the seman- 
tic expressiveness of XML. Among the most important of these approaches has been that of defining 
integrity constraints in XML. In a companion paper we defined strong functional dependencies in 
XML(XFDs). We also presented a set of axioms for reasoning about the implication of XFDs and 
showed that the axiom system is sound for arbitrary XFDs. In this paper we prove that the axioms 
are also complete for unary XFDs (XFDs with a single path on the l.h.s.). The second contribution 
of the paper is to prove that the implication problem for unary XFDs is decidable and to provide a 
linear time algorithm for it. 

1 Introduction 

The extensible Markup Language (XML) [Hj has recently emerged as a standard for data representa- 
tion and interchange on the Internet [23 Q • While providing syntactic flexibihty, XML provides little 
semantic content and as a result several papers have addressed the topic of how to improve the semantic 
expressiveness of XML. Among the most important of these approaches has been that of defining integrity 
constraints in XML [S]^]. Several different classes of integrity constraints for XML have been defined 
including key constraints (7||Hlin], path constraints [21EllH1^2i and inclusion constraints ^1^] and 
properties such as axiomatization and satisfiability have been investigated for these constraints. One 
observation to make on this research is that the flexible structure of XML makes the investigation of 
integrity constraints in XML more complex and subtle than in relational databases. However, one topic 
that has been identified as an open problem in XML research and which has been little investigated 
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is how to extended the oldest and most well studied integrity constraint in relational databases, namely 
functional dependencies (FDs), to XML and then how to develop a normalization theory for XML. This 
problem is not of just theoretical interest. The theory of FDs and normalization forms the cornerstone 
of practical relational database design and the development of a similar theory for XML will similarly 
lay the foundation for understanding how to design XML documents. In addition, the study of FDs in 
XML is important because of the close connection between XML and relational databases. With current 
technology, the source of XML data is typically a relational database T and relational databases are 
also normally used to store XML data [201 • Hence, given that FDs are the most important constraint in 
relational databases, the study of FDs in XML assumes heightened importance over other types of con- 
straints which are unique to XML PO]- The only papers that have specifically addressed this problem are 
the recent papers 122 ■ Before presenting the contributions of 1221 ' '^^ briefly outline the approaches 
to defining FD satisfaction in incomplete relational databases. 

There are two approaches, the first called the weak satisfaction approach and the other called the 
strong satisfaction approach 5 . In the weak satisfaction approach, a relation is defined to weakly satisfy 
a FD if there exists at least one completion of the relation, obtained by replacing all occurrences of nulls 
by data values, which satisfies the FD. A relation is said to strongly satisfy a FD if every completion of 
the relation satisfies the FD. Both approaches have their advantages and disadvantages (a more complete 
discussion of this issue can be found in I22|'l. The weak satisfaction approach has the advantage of 
allowing a high degree of uncertainty to be represented in a database but at the expense of making 
maintenance of integrity constraints much more difficult. In contrast, the strong satisfaction approach 
restricts the amount of uncertainty that can be represented in a database but makes the maintenance of 
integrity constraints much easier. However, as argued in ISi, both approaches have their place in real 
world applications and should be viewed as complementary rather than competing approaches. Also, it 
is possible to combine the two approaches by having some FDs in a relation strongly satisfied and others 
weakly satisfied [T7j . 

The contribution of '3' was, for the first time, to define FDs in XML (what we call XFDs) and then 
to define a normal form for a XML document based on the definition of a XFD. However, there are some 
difficulties with the definition of a XFD given in [3] . The most fundamental problem is that although it is 
explicitly recognized in the definitions that XML documents have missing information, the definitions in 

while having some elements of the weak instance approach, are not a strict extension of this approach 
since there are XFDs that are violated according to the definition in yet there are completions of the 
tree that satisfy the XFDs (see for an example). As a result of this it is not clear that there is any 
correspondence between FDs in relations and XFDs in XML documents. The other difficulty is that the 
approach to defining XFDs is not straightforward and is based on the complex and non-intuitive notion 
of a " tree tuple" . 

In [221 ^ different approach was taken to defining XFDs which overcomes the difficulties just discussed 
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with the approach adopted in '3'. The definition in (22) is based on extending the strong satisfaction 
approach to XML. The definition of a XFD given in was justified formaUy by two main resuhs. The 
first result showed that for a very general class of mappings from an incomplete relation into a XML 
document, a relation strongly satisfies a unary FD (only one attribute on the l.h.s. of the FD) if and only 
if the corresponding XML document strongly satisfies the corresponding XFD. The second result showed 
that a XML document strongly satisfies a XFD if and only if every completion of the XML document also 
satisfies the XFD. The other contributions in 22! were firstly to define a set of axioms for reasoning about 
the implication of XFDs and show that the axioms are sound for arbitrary XFDs. The final contribution 
was to define a normal form, based on a modification of the one proposed in 3 , and prove that it is a 
necessary and sufficient condition for the elimination of redundancy in a XML document. 

The contribution of this paper is to extend the work in j22| in two important ways. As just mentioned, 
m 122] a set of axioms for XFDs were provided and shown to be sound. In this paper we prove that the 
axioms are also complete for unary XFDs. The second contribution of the paper is to prove that the 
implication problem for unary XFDs is decidable and to provide a linear time algorithm for it. These 
results have considerable significance in the development of a theory of normalization for XML documents. 
In relational databases, the classic results on soundness and completeness of Armstrong's axioms 0] and 
the resulting closure algorithm for FD implication play an essential role in determining whether a relation 
is in one of the classic normal forms. Similarly, the results in this paper are an important first step in 
the development of algorithms for testing the normal form proposed in 22 . In addition, the result on 
completeness is of theoretical interest in itself since it ensures that there are no other 'hidden' axioms for 
reasoning about the implication of XFDs. 

The rest of this paper is organized as follows. Section 2 contains some preliminary definitions. In 
Section 3 a XFD is defined. In Section 4 axioms for XFDs are presented and are shown to be sound for 
arbitrary XFDs and complete for unary XFDs. In Section 5 the implication problem for unary XFDs 
is investigated and a linear time algorithm for the implication problem is presented and shown to be 
correct. Finally, Section 6 contains concluding comments. 

2 Preliminary definitions 

In this section we present some preliminary definitions that we need before defining XFDs. We firstly 
present the definition of a XML tree adapted from the definition given in jH]. 

Definition 1 Assume a countably infinite set E of element labels (tags), a countable infinite set A of 
attribute names and a symbol S indicating text. An XML tree is defined to be T = (V, lab, ele, att, val, Vr) 
where y is a finite set of nodes in T; lab is a function from to E U A U {S}; ele is a partial function 
from y to a sequence of V nodes such that for any w e 1/, if ele{v) is defined then lab{v) e E; att is a 
partial function from V x A to V such that for any v E V and ^ G A, if att{v, I) = vi then lab{v) G E 
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and lab{vi) = I] val is a function such that for any node in w G V,val{v) — v if lab{v) G E and val{v) is 
a string if either lab{v) — S or lab{v) e A; Vr is a distinguished node in V cahed the root of T and we 
define lab{vr) = root. Since node identifiers are unique, a consequence of the definition of val is that if 
wi G E and U2 G E and vi ^ then val(y\) ^ val(v-2). We also extend the definition of val to sets of 
nodes and if V\ C 1/, then wa^(Vi) is the set defined by uaZ(Fi) = {val(y)\v e Vi}. 

For any u G t^, if ele(v) is defined then the nodes in ele{v) are cahed subelements of w. For any Z G A, 
if att{v, I) = vi then ui is cahed an attribute of v. Note that a XML tree T must be a tree. Since T is a 
tree the ancestors of a node v, denote by Ancestor{v) are defined as in Definition ^ The children of a 
node V are also defined as in Definition ^and we denote the parent of a node v by Parent{v). 

We note that our definition of val definition differs slightly from that in 8 since we have extended the 
definition of the val function so that it is also defined on element nodes. The reason for this is that we 
want to include in our definition paths that do not end at leaf nodes, and when we do this we want to 
compare element nodes by node identity, i.e. node equality, but when we compare attribute or text nodes 
we want to compare them by their contents, i.e. value equality. This point will become clearer in the 
examples and definitions that follow. 




Figure 1: A XML tree 

We now give some preliminary definitions related to paths. 

Definition 2 A path is an expression of the form /i. • • • n > 1, where U G E U A U {S} for all 
i^l < i < n and li — root. If p is the path ^i. • • ■ .Z„ then Last{p) is In- 

For instance, in Figure ^ root and root . Division are paths. 

Definition 3 Let p denote the path /i. • ■ • .Z„. The function Parnt{p) is the path li. ■ ■ ■ .Z„_i. Let p 
denote the path h. ■ ■ ■ .In and let q denote the path qi. ■ ■ ■ .qm- The path p is said to be a prefix of the 
path q if n < m and li — qi, . . . ,ln — qn- Two paths p and q are equal, denoted by p = q, if p is a prefix 
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of q and g is a prefix of p. The path p is said to be a strict prefix of g if p is a prefix of q and p ^ q. 
We also define the intersection of two paths pi and p2, denoted but pi C\p2, to be the maximal common 
prefix of both paths. It is clear that the intersection of two paths is also a path. 

For example, in Figure ^ root . Division is a strict prefix of root .Division. Section and 
root . Division . d# n root . Division . Employee . Emp# . S = root . Division. 

Definition 4 A path instance in a XML tree T is a sequence wi. • • • .u„ such that vi — Vr and for all 
Vi,l < i < n,Vi G V and Vi is a child of Vi^i- A path instance vi. - ■ ■ .v„ is said to be defined over the 
path li. ■ ■ ■ .In if for all Vi,l < i < lab{vi) — li. Two path instances vi. ■ ■ ■ .5„ and v[. ■ ■ ■ .v'^ are said to 
be distinct if w,; =^ v[ for some j, 1 < i < n. The set of path instances over a path p in a tree T is denoted 
by Paths (p) 

Definition 5 An extended XML tree is a tree {V U 'N,lab,ele,att,val,Vr) where N is a set of marked 
nulls that is disjoint from V and if w G N and w ^ E then val{v) is undefined. 

Definition 6 Let T be a XML tree and let P be a set of paths. Then (T, P) is consistent if: 

(i) For any two paths /i. • ■ • .Z„ and I'l - ■ ■ ■ -l'^ in P such that l'^^ — li for some i, 1 < i < n then 
1 7—7' 7' • 

'1- ' ' ' ■'•i — ■'■mi 

(ii) If vi and V2 are two nodes in T such that vi is the parent of V2, then there exists a path li. ■ ■ ■ .In 
in P such that there exists i and j, where 1 < i < n and \ < j < n and i < j and label{vi) = h and 
label{v2) — Ij. 

Definition 7 Let T be a XML tree and let P be a set of paths and such that (T, P) is consistent. Then 
a minimal extension of T, denoted by Tp, is an extended XML tree constructed as follows. Initially let 
Tp be T. Process each path p in P in an arbitrary order as follows. For every node in v in T such 
that lab{v) appears in p and there does not exist a path instance containing v which is defined over p, 
construct a path instance over p by adding nodes from N as ancestors and descendants of v. 

The next lemma follows easily from the construction procedure. 

Lemma 1 Tp is unique up to the labelling of the null nodes. 

For instance, the minimal extension of the tree in Figure ^is shown in Figure |3 

Definition 8 A path instance vi. ■ ■ ■ .u„ in T is defined to be complete if wi. • • • .w„ G Tp. A tree T is 
defined to be complete w.r.t. a set of paths P if {T,P) is consistent and T = Tp. Also we often do not 
need to distinguish between nulls and so the statement v =-L is shorthand for 3j{v —-Lj) and v 7^_L is 
shorthand for /Bj(v =-Lj). 

The next function returns all the final nodes of the path instances of a path p. 
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Figure 2: The minimal extension of a XML tree. 

Definition 9 Let Tp be the minimal extension of T . The fmiction N{p), where p is the path li. ■ ■ ■ .In, 
is defined to be the set • • ■ .Vn £ Paths{p) Av — Vn}- 

For example, in Figure El A^(root .Division. Section. Employee) = {utjWs} and 

A^(root .Division. Section ) — {114, _Li}. 

We now need to define a function that is related to ancestor. 

Definition 10 Let Tp be the minimal extension of T. The function AAncestor{v , p) where w S F U N, 
p is a path and v £ N{p), is defined by A Ancestor {v,p) = {v'\v' G {v[, ■ ■ ■ ^v'^} A v = v'„ A v[. ■ ■ ■ .v'^ G 
Paths{p)}. 

For example, in Figure |21 AAncestor(w5, root . Division. Section .Employee) = {w^., W2, J-i, W5}. The 
next function returns all nodes that are the final nodes of path instances of p and are descendants of v. 

Definition 11 Let Tp be the minimal extension of T.The function Nodes{v,p), where w e U N and p 
is a path, is the set defined by Nodes{v,p) = {x\x £ N{p) Av £ AAncestor{x , p)} . Note that Nodes{v,p) 
may be empty. 

For example, in Figure |21 A^O(ies(wr, root . Division. Section .Employee) ~ {v^,vy}, 
A^O(ies(wi, root . Division. Section. Employee) = {vj}, iVodes(u7, root .Division) = (j). 

Definition 12 The partial ordering > on the set of nodes F in a XML tree T is defined by vi > V2 iff 
V2 £ Ancestor{vi), where vi and V2 are in V. 

In a similar fashion, we define a partial ordering on paths as follows. 

Definition 13 The partial ordering > on a set of paths P is defined by p2 > Pi if Pi is a prefix of p2, 
where pi and p2 are paths in P. 
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For example, in Figure |21 root .Division. D# > root .Division. Also, root .Division. D# and 
root .Division. Section arc incomparable. 

Lastly we extend the definition of the val function so that uaZ(_Lj) =-Lj. Note that different unmarked 
nulls are not considered to be equal and so val{J-i) ^ val{J-j) \ii ^ j. 

3 Strong Functional Dependencies in XML 

This leads us to the main definition of our paper. 

Definition 14 Let T be a XML tree and let P be a set of paths such that (T, P) is consistent. A XML 
functional dependency (XFD) is a statement of the form: pi,---,pk q where pi,---,Pk and q are 
paths in P. T strongly satisfies the XFD ii pi ~ q for some i, \ < i < k or for any two distinct path 
instances vi. - ■ ■ .Vn and v'-^. - ■ ■ .v'^ in Paths{q) in Tp, ((«„ =L Av'^ =±) V (w„ Av'^ =_L) V =_L 
Av'„ ^_L) V {vn ^-L Av'n 7^_L Aval{vn) ^ val{v'„) ^ 3i,l < i < k, such that a;^ 7^ if Last{pi) e E 
else Nodes{xi,pi) and Nodes{yi,pi) and val{Nodes{xi,pi)) H val{Nodes{yi, pi)) — 0, where Xi = 
{v\v e {«!,•■ •,?;„} A u e iV(K nq)} and y^ = G {«;,•■•,<} A u £ A^(p, n.?)}. 

We note that since the path pi Cl q is a prefix of q, there always exists one and only one node in 
{v'l, ■ ■ ■ , w^} that is also in N{pi n q) and so Xi is always defined and unique. Similarly for y^. 

We now outline the thinking behind the above definition firstly for the simplest case where the l.h.s. 
of the XFD contains a single path. In the relational model, if we are given a relation r and a FD A ^ B, 
then to see if A ^ i? is satisfied we have to check the B values and their corresponding A values. In the 
relational model the correspondence between B values and A values is obvious - the A value corresponding 
to a S value is the A value in the same tuple as the B value. However, in XML there is no concept of 
a tuple so it is not immediately clear how to generalize the definition of an FD to XML. Our solution 
is based on the following observation. In a relation r with tuple t, the value t[A] can be seen as the 
'closest' A value to the B value t[B]. In Definition Elwe generalize this observation and given a path 
instance vi. - ■ ■ .u„ in Paths(q), we first compute the 'closest' ancestor of w„ that is also an ancestor of a 
node in N{p) {xi in the above definition) and then compute the 'closest p-nodes' to be the set of nodes 
which terminate a path instance of p and are descendants of xi. We then proceed in a similar fashion 
for the other path v[. ■ ■ ■ .v'^ and compute the 'p-nodes' which are closest to v'^^. We note that in this 
definition, as opposed to the relational case, there will be in general more than one 'closest p - node' 
and so Nodes{xi,p) and Nodes{yi,p) will in general contain more than one node. Having computed the 
'closest p-nodes' to w„ and w^, if val{vn) ^ val{v'^) we then require, generalizing on the relational case, 
that the vaV s of the sets of corresponding 'closest p-nodes' be disjoint. 

The rationale for the case where there is more than one path on the l.h.s. is similar. Given a XFD 
Pi, ■ ■ ■ ,pk q and two paths vi. ■ ■ ■ .w„ and v[. ■ ■ ■ .v'^ in Paths{q) which end in nodes with different val, 
we firstly compute, for each pi, the set of 'closest pi nodes' to Vn in the same fashion as just outlined. 
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Then extending the relational approach to FD satisfaction, we require that in order for pi , • • ■ , — > q 
to be satisfied there is at least one pi for which the val's of the set of 'closest pi nodes' to Vn is disjoint 
from the val's of the set of 'closest pi nodes' to w^. We now illustrate the definition by some examples. 

Example 1 Consider the XML tree shown in Figure |2| and the XFD 

root .Department . Lecturer . Subject . Subject* root .Department . Lecturer . Subject . SubjName . S. 
Then Vr-V1.v5.v13.vn.v22 and Vr.v2.vg.v15.v21.v2A are two distinct path instances in 

Pat/is(root .Department . Lecturer . Subject . SubjName . S) and uaZ(u22) = "nl" and wa^(w24) = "n2". 
So A^(root . Department . Lecturer . Sub j ect . Sub j ect#n 

root .Department . Lecturer . Subject . SubjName . S) ) = {ui3, D14, W15} and so xi = U13 and yi = V15. 
Thus i;a^(Afo(ies(a;i, root .Department .Lecturer .Subject .Subjects)) — {"si"} and 

waZ(iVo(ies(?;i, root . Department . Lecturer . Subject . Subject*)) = {"s2"}. Similarly for the paths 
Vr.vi.VQ.vi3.vi4^.V2'i and Vr .V2.vg .vi^, .V21.V24, and so the XFD is satisfied. We note that if we change val of 
node W18 in Figure |31to "si" then the XFD is violated. 

Consider next the XFD root .Department .Head root . Department. Then Vr.vi and Vr.V2 are two 
distinct paths instances in Paift,s(root . Department) and val(vi) ~ vi and val{v2) — f2- Also 

iV(root .Department . Head fl root . Department) = {vi,V2} and so Xi — vi and yi — V2. Thus 
ua^ ( Abodes (xi, root .Department . Head)) = {"hi"} and yaZ(iVo(ies(?/i, root .Department .Head)) = {"h2"} 
and so the XFD is satisfied. We note that if we change val of node vg in Figure Olto "hi" then the XFD 
is violated. 

Consider next the XFD root. Department. Lecturer. Lname, root .Department .Dnsime 

root .Department . Lecturer . Subject . Subjects. Then Vr.V1.v5.v13.v1e and Vr.V2.vg.V15. _Li are two 
distinct path instances in Paift,s(root . Department . Lecturer . Subject . Subjects) and i;a^(t;ig) = "si" 
and the final node in Vr.v2.vg.v15. _Li is null. 

Then iV(root .Department . Lecturer .Lnamefl root .Department . Lecturer . Subject . Subjects) = 
{v5,ve,vg} and so xi — V5 and yi — vg and so waZ(-/Vo(ies(xi, root .Department .Lecturer .Lname)) = 
"11" and uaZ(-/Vo(ies(2/i, .root. Department. Lecturer. Lname)) — "11". We then compute 

iV(root .Department .Dname n root . Department . Lecturer . Subject . Subjects) = {vi,V2} and so 
X2 = vi and y2 = V2 and so val{Nodes(x2, .root .Department . Dname)) = "dl" and 

i;aZ(iVo(ies(?;2, -root .Department .D nsime)) = "d2". Similarly, for the paths Vr.V1.Ve.V14.V1s, we de- 
rive that X2 — ve and y2 — vg and so 

waZ(7Vo(ies(a;2, .root . Department .Dname)) 7^ ua^(iVodes (2/27 -root .Department . D name)) = "d2" and 
so the XFD is satisfied. Thus xi ^ yi. Similarly for Vr.vi.ve and Vr.v2.vg.v15. _Li and so the XFD is 
satisfied. 
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Figure 3: A XML tree illustrating the definition of a XFD 

4 Axiomatization for XFDs 

In this section we address the issues of completeness of the axiom system for reasoning about implication 
of XFDs that was presented in '27. The axiom system is the following. 

Axiom Al. pi, - ■ ■ ,pk Pi for any Pi, I < i < k. 

Axiom A2. If pi, • • ■ ,pfe ^ g, then p,pi, ■ ■ ■ ,pk —>■ q for any path p. 

Axiom A3, li pi, ■ ■ ■ ,pk q, and q ^ s then pi, - ■ ■ ,pk s. 

Axiom A4. If pi, • • • ,pfe q and Vi, l<i<k,pinq — root, then p ^ q for any path p. 
Axiom A5. If p — *■ g then p' ^ q for all paths p' such that p D q is prefix of p' and either p' is a 
prefix of p or p' is a prefix of q. 

Axiom A6. If Last{p) e E and q is a prefix of p then p ^ q. 
Axiom A7. If Last{q) G A then Parnt{q) q. 
Axiom A8. p root for any path p. 

Theorem 1 Axioms Al - A8 are sound for implication of arbitrary XFDs. 

Proof. For the sake of the completeness of this paper, the proof from is reproduced in the Appendix. 

We now illustrate these axioms by an example. 

Example 2 Consider the XML tree show in Figure ^ and the set E of XFDs {root . A .B . C . C# 
root.A.D.E, root.A.D.E root . A .D .E .F .F#, root. A root.G}. It can be easily verified that the 
XML tree in Figure 01 satisfies S. Then from S and the axioms we can deduce that the following XFDs 
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"fl" 



Figure 4: XML tree illustrating axioms for XFDs. 

are implied by E"'^ : from Al we can derive root . A root . A, from A2 and root . A root . G we can de- 
rive that root . A , root . A . B . C ^ root . G, from A3 and root . A . B . C . C# ^ root . A . D . E and root . A . D . E 

root . A . D . E . F . F# we can derive that root . A . B . C . C# ^root . A . D . E . F . F#, from A4 and root . A 
root . G wc can derive that root . A . D . E — > root . G, from A5 and root . A . B . C . C# —>■ root . A . D . E wc can 
derive that root. A. B root.A.D.E and that root. A. D root.A.D.E, from A6 we can derive that 
root . A . D . E root . A, from A7 we can derive that root . A . D . E . F root . A . D . E . F . F# and from A8 
we derive that root. A. D root. 

This now leads to the first major result of the paper. 

Theorem 2 Axioms Al - A8 are complete for unary XFDs 

Proof. See Appendix. 



5 Decidability Of Implication for Unary XFDs 

In this section we derive the second main result of the paper by showing that the implication problem 
for unary XFDs is decidable. We do this by constructing an algorithm for generating P+, the set of all 
paths q such that q G P+ if and only if p q E S+ and then prove that the algorithm is correct. We 
note also that the running time of the algorithm is linear in the number of XFDs in E. Firstly we present 
an algorithm which is analogous to the classic chase procedure for relations |19|. 
Before presenting the next algorithm, we define two functions. 

Definition 15 The function Anc{p), where p is a path, is the set defined by Anc{p) = {q\q is a strict 
prefix of p}. The function Att{p) is the set defined by Att{p) = {q\p = Parnt{q) A Last{q) E A}. 

^We do not show all the XFDs that can be derived from the axioms 
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Algorithm 1 

INPUT: A set E of unary XFDs and a tree T which is complete 

w.r.t. the 

set of paths in E. 

OUTPUT: A XML tree T satisfying the set of XFDs and which is complete 
w.r.t. the set of paths in E. 
f = T; 

Repeat until no more changes can be made to T 
For each p ^ g G E do 
If Last{q) ^ E then 

If there exist v\, v2, f3, w4 G T such that 

wl, v2 G N{p), vi, vA G N{q) and val{vl) = val{v2) and 
waZ(w3) < val{vA) then 
faZ('(;4) := val{v'A); 
If Last{q) G E then 

If there exist vl, v2, v3, vA in T such that 

t;l, i;2 G N{p), v3, vA G N{q) and t;aZ(t;l) = '(;a?(t;2) then 
attach all descendants of vA to v2>; 
DeleteSameAtts (v3) ; 
vl : = Parent (v3) ; vr : = Parent (v4) ; 
repeat until vl = vr 

attach all descendants of vr to vl except for v4; 
deleteSameAtts(vl) ; 
delete (v4) ; 
v4 := vr; 

vl:= Parent (vl); vr := Parent (vr); 
endrepeat ; 

endf or 
endrepeat 

procedure DeleteSameAtts (node v) ; 

For amy pair of nodes u5 eind u6 such that eind v& are children 

of V and lab{v5) = lab{v6) and lab{v5) G A and 
val{v5) < val{v6) then delete v6; 
return ; 
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We now illustrate Algorithm 1 by an example. 

Example 3 Let E be the set of XFDs {root.A.A# root.A.B.B#, root.A.B.B# —f root . A . B . C . C#, 
root . A . B ^ root . A . B . D} and let the initial tree T be as shown in Figure |5l Then if we apply the XFD 
root.A.A# root.A.B.B# the resulting tree is shown in Figure If we then apply root.A.B.B# 
root.A.B.C.C# the resulting tree is shown in Figure Finally, if we apply root. A. B root.A.B.D 
then the tree is shown in Figure |H1 
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Figure 5: Initial XML tree 
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Figure 6: XML tree after applying root . A . A# root . A . B . B# 




Figure 7: XML tree after applying root . A . B . B# ^ root . A . B . C . C# 

Lemma 2 Algorithml always terminates. 
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Figure 8: XML tree after applying root . A . B ^ root . A . B . D 



Proof. The function Count{p), where p is a path and Last{p) e E, is defined to be |A''(p)|, were || 
denotes cardinality. The function Sum{p), where p ^ E is defined as follows. For any text string value t 
define the function int{t) to be the integer value obtained by considering t to be an integer to base 256 
and then define Sum{p) to be J2veN{p) int{val{v)). At each iteration of the repeat loop either Count{p) 
or Sum{p) strictly decreases for at least one path p. Hence since both Count{p) and Sum{p) are both 
bounded below by Algorithm 1 must terminate. □ 

Firstly, let us denote by Pe the set of paths that appear on the l.h.s. or r.h.s. of any XFD in a set of 
unary XFDs S. 

Lemma 3 The tree T produced by Algorithm 1 is complete w.r.t. P^. 

Proof. The proof is by induction. Initially the result is true because of the restriction placed on the 
input tree T by Algorithm 1. Assume then the result is true after iteration k — 1. Then during iteration 
k, the only path instances which can possibly be changed are those in Paths{q) or Paths{Anc{q)) for 
some g e E. However, if we merge two nodes in N{q) then we also merge their ancestor nodes and so 
after iteration k, Paths{q) and Paths{Anc{q)) will again contain only complete paths and so the result 
is established. □ 

Lemma 4 The tree generated by Algorithm 1 satisfies S. 

Proof. Prom the definition of the algorithm, the algorithm terminates only when there is no XFD 
that is violated. □ 

Next, we introduce an algorithm for calculating the closure of a set of XFDs. 



13 



Algorithm 2 

INPUT: A set E of unary XFDs and a path p. 
OUTPUT: P+ the set of paths such that q e P+ iff 
p — > g is implied by S. 
1:F+ = {p, root}; 

2: If Last{p) e E then _P+ = _P+ U y4nc(p); 
3: for each q e P+ do P+ := P+ U Att{q); 
Unused — E; 

repeat until no more changes to P+ 
Choose arbitrarily r — > s from E; 

4: (If 3 pi G P+ such that ?' n s is prefix of pi and 
either pi is a prefix of r or pi is a prefix of s) 

5:V (r e P+) 

6 : V (r n s = j'oot) then 

P+ ^ P+ \J {s}; 

Unused = Unused — {r ^ s}; 

7: if Last{s) G E then P+ = P+ U ylnc(s) U ^tt(s); 
endrepeat 

We note that it is easily seen that since each XFD in E is used only once, the running time of 
Algorithm 2 is linear in the number of XFDs in E. We now proceed to prove that Algorithm 2 is correct. 
Two cases are considered separately. 

Case 1: Last{p) ^ E 

First construct a tree Tq with the following properties. Tq is complete w.r.t. Ps and for every path 
appearing in Px;, except for the root, there are exactly two path instances of the path in Tq. Also, the 
path instances for p have the property that the val of the final nodes in the path instances are the same 
whereas the val of the end nodes for the path instances of any other path in Ps are distinct. Such a tree 
can always be constructed. We now illustrate the construction by an example. 

Example ^ Let E = {root . A . B . B# ^ root . A . A#, root . A . C . C# ^ root . A . B, root . A . A# ^ root . D . D#} 
and let p be the path root.A.B.B#. Then the tree Tq is shown in Figure El 

The next step is using as input the set of XFDs returned in P+ by Algorithm 2 and the tree Tq, generate 
the tree Tq using Algorithm 1. We note that it follows from Lemma 0]that Tq satisfies P^ . We now 
prove some preliminary lemma before establishing the main result. 

Lemma 5 Let vi. - ■ ■ .w„ and w'j^. • • • .v'^ he two distinct path instances in Paths{q) in Tq for any path q 
in Ps. Then the only common node to both path instances is root. 
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Figure 9: A XML tree 

Proof. Suppose to the contrary that node vj is common to both path instances. Then vj G N{s) for 
some path s that is a prefix of q. So because of the definition of Tq there must exist two path instances 
in Paths{s) and so there exists another node vj in N{s) that is distinct from Vj. There are then two 
possibihties. The first is that there exists another path instance in Paths{q) that contains 

wj. If this is the case then since vj and vj are distinct the paths vi. ■ ■ ■ .w„, v'l. ■ ■ ■ .v'^ and u". • ■ • .v[[ are 
distinct which contradicts the fact that To has only two path instances for any path. The other possibihty 
is that there is no path instance in Paths{q) in Tq that contains vj but this contradicts the fact that Tq 
is complete w.r.t. P^. So either possibility leads to a contradiction so we conclude that the only common 
node to vi. • ■ • .w„ and v[. ■ ■ ■ .v'^ is root. □ 



Lemma 6 Let q be any path in P^. Then if there exist two distinct path instances vi. ■ ■ ■ .Vn and v[. ■ ■ ■ .v[^ 
in Paths{q) in Tq such that vi. ■ ■ ■ .Vn and v'l - ■ ■ ■ .v'^ have a common node that is not the root then there 
exists s' such that s' G Anc{q) and s' G P+ . 

Proof. We prove the result by induction on the number of steps in constructing Tq. Initially the 
result is true for To by Lemma \S\ Assume inductively then that it is true after iteration fc — 1. The 
only way that we can have that vi. ■ ■ ■ . and v'l - ■ ■ ■ .v'j^ in Paths{q) have a common non root node after 
iteration k is if we merge two ancestor nodes of 5„ and 5^. For this to happen we have to have s such 
that s e Anc{q) and s G P+. □ 



Lemma 7 If a XFD r ^ s is violated in Tq then p ^ s is violated in Tq . 

Proof. If 7' ^ s is violated then there exist distinct path instances ui. and v[. ■ ■ ■ .v',^ in 

Paths{s) such that val{vn) ^ val{v'^). However by the construction of Tq, N{p) contains only two 
nodes, say vi and V2, such that val{vi) = val{v2). Let us then compute xi — {v\v G {wi, • • • ,w„} A w G 
N{p n s)} and yi ~ {v\v G {w^, • • • , w^} A w G N{p n s)}. If xi = yi then Nodes{xi,p) = Nodes{yi,p) 
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and so Nodes{xi,p) H Nodes{yi,p) ^ (p S'-nd so p ^ s is violated. If xi ^ iji then we must have 
that val{Nodes{xi,p)) fl val{Nodes{yi,p)) ^ (j) becausewi G Nodes{xi,p) and V2 G Nodes{yi,p) and 
val{vi) = val{v2) and so p ^ s is again violated. □ 



Lemma 8 If there is a path q in Ps smc/i that Last{q) ^ E t/ien q G iff there exist two distinct nodes 
vi and V2 in N{q) in Tq such that val{vi) = val{v2). 

Proof. 

//.■ We prove the result again by induction on the number of steps in constructing Tq- Initially the 
result is true since p € and the val of the two nodes in N{p) is the same. Suppose then that there 
exist two nodes in in N(q) such that val(vi) ^ val{v2) before step k and val{vi) = val(v2) after step k. 
By the definition of Algorithm 1 the only way for this to happen is if g G P+. 

Only If: We shall show the contrapositive that if there exist two nodes vi and V2 in N{q) such that 
val(vi) ^ val{v2) then q ^ P+. If there exist two nodes vi and V2 in N(q) such that val{vi) ^ val(v2) 
then using the same reasoning as in Lemma [Tjit follows that Tq violates p q and so by Lemma 2]we 
must have that q ^ P+. □ 



Lemma 9 // there is a path q in such that Last{q) G E then q G P^ iff = 1 in Tq. 

Proof. 

//." We prove the result by induction on the number of steps in generating Tq. Initially the result is 
true since p is the only node in P+ and Last{p) ^ E. Suppose then it is true after iteration k. Then by 
definition of the algorithm, the only case where we can have a path q such that 7^ 1 after step k 

but \N{q)\ = 1 after step fc + 1 is if g G P+. 

Only If: Suppose 7^ 1. By the construction of Tq it can easily be seen that N{p) contains only 

two nodes and val of the nodes are equal. Thus using the same arguments as used in Lemma Qthat 
p — > (7 is violated in Tq which is a contradiction since q G P^ and by Lemma 01 Tq satisfies p q. Hence 
we conclude that \N{q)\ = 1. □ 



Theorem 3 Algorithm 2 correctly computes P+ when Last(jj) ^ E. 

Proof. We firstly show that if g G P^ then p ^ g is in S+. We show this by induction on the 
number of iterations in computing P+. At line 1 P'^ contains p and root and p G P^ by axiom Al and 
root G P+ by axiom A8. At line 2 the result follows by axiom A6 and at line 3 by axiom A7. Hence the 
inductive hypothesis is true at the commencement of the loop. Let P^ denote the computation of P+ 
after iteration j. Assume then that the hypothesis is true after iteration j — I. If the q is added to P^ 
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because of line 4 then p — ^ g is in P+ because of axiom A5 and the induction hypothesis. If q is added at 
hue 5 then r £ P+ by the induction hypothesis and axiom A3. If q is added because hne 6 then q G P"*" 
by axiom A4. If q is added as a result of line 7 then q G because of axioms A6 and A7. 

Next we show that if p — > q e E+ then q G P+ . We firstly claim that To satisfies E (note that this 
does not follows from Lemma 0| since we are using P+ as input to Algorithm 1 rather than S). Let r — + s 
be any XFD in S. Suppose firstly that r = root. If r ^ s is violated in Tb then by Lemma [7|p —> s must 
be violated. However since root s E Y, then we must have that s G or else s could be added at line 
6 contradicting the definition of P+. So by Lemma 2|p — > s is satisfied in Tq hence we conclude that 
r — > s is satisfied in Tq or else by Lemma {7\p ^ s is violated which is a contradiction. Suppose then that 
r —^ s is violated in Tq and r ^ root. The first way for this to happen is if there exist two path instances 
vi. ■ ■ ■ .Vn and v'l. ■ ■ ■ .v'^ in Paths{s) such that xi ^ t/i, where xi — {v\v G {ui, • • ■ , u„} A w G N{r n s)} 
and yi — {v\v G {v[, ■ ■ ■ ,v'^} A w G N{r n s)}, and there exists vi G Nodes{xi,r) and V2 G Nodes{yi,r) 
such that val{vi) — val{v2)- For this to happen it follows from Lemma |Slthat r G We must also 
have that s G P+ or else s could be added to P+ by line 5 thus contradicting the definition of P+. 
However, if s G P^ then by Lemma 0]p — > s is satisfied in To which contradicts the assumption that 
r —^ s is violated in Tq by Lemma {7\ We conclude that in this case r ^ s is satisfied. The second 
way that r —> s could be violated in Tq is if there exist two path instances vi. ■ ■ ■ .u„ and v'l. ■ ■ ■ .w^ in 
Paths{s) such that xi — yi. If xi — root then r n s — root and so s G P^ or else it could be added at 
line 6 contradicting the definition of P+. If r ^ s is violated, then by Lemma dp — > s is violated which 
contradicts Lemma 0]and so we conclude that ?' ^ s is satisfied. Assume then that xi ^ root and so by 
Lemma Eland since xi = yi there exists s' such that s' G Anc{s) and s' G P+. It follows that r n s is a 
prefix of s' and so r n s G Anc{s) and thus r n s G P+ or else it could be added at line 7 of Algorithm 2 
which contradicts the definition of P+. Next, since r n s G P+ it follows that s must be in P+ or else 
it could be added at line 4 since r n s is a prefix of r which contradicts the definition of P+. However, 
by Lemma 01 Tq satisfies p —> s and if r ^ s is violated in Tq then p ^ s is violated in Tq by Lemma (3 
which is a contradiction and so r s must be satisfied in Tq. 

To complete the proof suppose that p ^ q G S+. Since Tq satisfies E then Tb also satisfies p ^ q. 
Suppose firstly that Last{q) G E. If |-/V(<z)| ^ 1 then using a similar argument to Lemma [71 this would 
imply that p — > q is violated in Tb which is a contradiction and so \N{q)\ ~ 1. Then by Lemma 
q G P+. Suppose instead that that Last{q) ^ E. By definition of Algorithm 1, there exists two nodes 
in N{q), say vi and V2, since Last{q) ^ E and non element nodes are not removed in the Algorithm. 
If val{vi) ^ val{v2) then using similar arguments to those used in Lemma [Jlit follows that Tq violates 
p ^ q which is a contradiction. Thus we must have that val{vi) = val{v2) and so by Lemma 09 G P^- 
This completes the proof. □ 

Case 2: Last{p) G E 
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First construct a tree Ti with the foUowing properties. Ti is complete w.r.t. P^. For p and every 
path q such that g is a prefix of p or g is an attribute node and its parent is a prefix of p, Ti contains 
exactly one path instance. For every other path in P^, Ti contains exactly two path instances. Also, the 
val of any node in Ti is distinct. Such a tree always exists. The construction procedure is illustrated by 
the following example. 

Example 5 Let E {root.A.B.B# root . A. B . C . C#, root . A .B . C . C# -» root . A . A#, root . A . D . D# ^ 
root.E.E#} and let p be the path root. A. B. Then the tree Ti is shown in Figure 1 101 




"cl" "c2" 



Figure 10: A XML tree 

The next step is using as input the set of XFDs returned in P"*" by Algorithm 2 and the tree Ti 
generate the tree Ti using Algorithm 1. We note that it follows from Lemma 0]that Ti satisfies P+. We 
now prove some preliminary lemmas before establishing the main result. 

Lemma 10 Let vi. ■ ■ ■ .?;„ and v'l. - ■ ■ .v'^ he two distinct path instances in Paths{q) in Ti for any path 
q appearing in P^. Then vi. ■ ■ ■ .v^ and v[. • ■ • .v'^ have a common node that is not the root only if 
qOp root. 

Proof. Suppose to the contrary that vi. ■ ■ ■ .Vn and v[. ■ ■ ■ .v'^ have a common node vj that is not the 
root and q H p — root. Then Vj G N{s) for some path s such that s is a prefix of q. So because of the 
definition of Ti there must exist two path instances in Paths{s) and so there exists another node in 
N{s) that is distinct from Vj. Then since qOp = root, it follows that s is not a prefix of p. Hence there 
must be two path instances of s in Ti and using the same reasoning as in Lemma |S1 shows that this leads 
to a contradiction. □ 

Lemma 11 Let q he any path in P^. Then if there exist two distinct path instances vi.---.Vn and 
v'l. ■ ■ ■ .v'^ in Paths{q) in Ti such that vi. ■ ■ ■ .w„ and v[. ■ ■ ■ .v'^ have a common node that is not the root 
then either (there exists s' such that s' G Anc{q) and s' G P^ ) or qC\p ^ root. 
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Proof. We prove the result by induction on the number of steps in constructing Ti. InitiaUy the 
result is true for Tq by Lemma IIUI Assume inductively then that it is true after iteration fc — 1. The 
only way that we can have that vi. ■ ■ ■ . and v'l - ■ ■ ■ .Vj-^ in Paths{q) have a common non root node after 
iteration k is if we merge two ancestor nodes of Vn and "D^. For this to happen by definition of Algorithm 
1 we have that s such that s £ Anc(q) and s G P^. □ 

Lemma 12 // a XFD r s is violated in Ti then p s is violated in Ti. 

Proof. If r — > s is violated then there exist distinct paths vi. ■ ■ ■ .w„ and v'l. ■ ■ ■ .w^j in Paths{s) such 
that val(vn) val(v'^). However by the construction of Ti, N{p) contains only one node and so if we 
compute Xi = {v\v e {wi, • ■ • , w„} A w G N{p n s)} and yi = {v\v g {v[, • ■ • , v'^} Av£ N{p n s)}. Then 
xi = Ui since T is a tree and so by definition of a XFD p — > s is violated. □ 

Lemma 13 // there is a path q in Ps such that Last{q) ^ E then q € P"*" iff there exist two distinct 
nodes Vi and V2 in N{q) in Ti such that val{vi) — val{v2)- 

Proof. 

//; We prove the result again by induction on the number of steps in constructing Ti. Initially the 
result is true since there is no path q and two nodes Vi and V2 in N{q) in Ti such that val{vi) = val{v2). 
Assume then it is true after iteration fc — 1. Suppose then that there exist two nodes in in N{q) such 
that val{vi) ^ val{v2) before step fc and val{vi) — val{v2) after step fc. By the definition of Algorithm 1 
the only way for this to happen is if g G P^ . 

Only If: We shall show the contrapositivc that if there exist two nodes vi and V2 in N{q) such that 
val(vi) ^ val{v2) then q ^ P+. If there exist two nodes vi and V2 in N{q) such that val{vi) ^ val{v2) 
then using the same reasoning as in Lemma I12l it follows that Ti violates p ^ q and so by Lemma 0]we 
must have that q ^ P+. □ 

Lemma 14 // there is a path q in Ps such that Last{q) G E then q G P+ iff \N{q)\ — 1 in Ti. 
Proof. 

//.■ We prove the result by induction on the number of steps in generating Pi. Initially, by definition 
of Pi, if |A^((i')| = 1 then either q = p or q is a prefix of p or Last{q) G A and the Parnt{q) is a prefix 
of p. li q ~ p then q G P+ by line 1. If q is a prefix of p then q G Anc{p) and so g G P+ by line 2. If 
Last{q) G A and the Parnt{q) is a prefix of p then q G P+ by line 3. Hence at the start of the repeat 
loop the result is true. Suppose then it is true after iteration fc. Then by definition of the Algorithm 1, 
the only case where we can have a path q such that \N{q)\ ^ 1 after step fc but \N{q)\ = 1 after step 
fc + 1 is if q G P+. 
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Only If: Suppose to the contrary that \N{q)\ ^ 1. By the construction of T\ it can easily be seen 
that N{p) contains only one node. Hence if we define x\ = {v\v G {wi, • ■ • A u G N{p n q))} and 
yi — {v\v e {v'l, • • • , w^} A u G N{p n q)} then xi — yi since \N{p)\ — 1 and so by the definition of a XFD 
p — > g is violated in Ti . This is a contradiction since q G and by Lemma 01 Ti satisfies p q and 
so we conclude that \N{q)\ — 1. □ 

Lemma 15 Let r s be a XFD in S, let vi. ■ ■ ■ .u„ and v'^. ■ ■ ■ .v'^ be path instances in Paths{s) in Ti, 
and let xi — {v\v G {wi, • • ■ , w„} A t; G N{r fl s)} and yi — {v\v G {v[, ■ • ■ , w^j} A u G Af(r n s)}. T/ien if 
pH s is a strict prefix of r C\ s and p ^ r ^T, and p s ^ then X\ ^ y\. 

Proof. The claim of the lemma can best be illustrated by a diagram. Let s\ denote the path in- 
stance v\. - ■ ■ .Vn and let S2 denote the path instance w^ - ' ' ' -^'n- Then the claim of the lemma is that 
only the situation illustrated in (b) of Figure El can arise, and not the situation illustrated in (a) of 
Figure 1111 We prove the result by induction on the number of steps to generate T\. Firstly we claim 
that T\ cannot have the structure illustrated in (a) of Figure ^2 Suppose that it has. Then since by 
definition of Ti there has to be two path instances for every path and x^ = yi, there must be another 
distinct node in N(r n s). However, using the same argument as in Lemma |S1 shows that we then 
contradict the fact that either there are exactly two path instances for any path in Ti or we contradict 
the fact that Ti is complete. Hence Ti must have the structure shown in (b) of Figure Assume 
inductively then that the property holds after iteration fc — 1 of Algorithm 1. The only way that (b) of 
Figure 1111 could possibly arise is if we merged path instances of r or path instances of s but this can- 
not occur because of the definition of Algorithm 1 and the assumptions that p ^ r ^ T, and p ^ s ^ Yj. □ 

root 




Figure 11: A XML tree illustrating Lemma IT^ 

Theorem 4 Algorithm 2 correctly computes P+ when Last{p) G E. 

Proof. The proof that if (7 G then p ^ q is in E+ is the same as for Theorem O 
Next we show that if p ^ g G S+ then q G P^. We firstly claim that Ti satisfies S. Let r — > s 
be any XFD in S. If r = root then it follows from Lemma 1121 and Lemma 01 that r — > s is satisfied 
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in Ti. Suppose then that r —>■ s is violated in Ti and r ^ root. The first way for this to happen is if 
there exist two path instances vi. ■ ■ ■ .u„ and v[. ■ ■ ■ .u'^ in Paths{s) such that xi ^ yi, where xi = {v\v e 
{vi, • • • , Vn}Av e N{rns)} and yi ~ {v\v e {v[, • • • , v'^}Av E N{rris)}, and there exist vi e Nodes{xi,p) 
and V2 G Nodes{yi,p) such that val{vi) = val{v2)- Also, since ^ yi, and V2 are distinct. For this 
to happen it follows from Lemma El that r G . We must also have that s e P+ or else s could be 
added to P+ at line 5 thus contradicting the definition of P+. However, if s € P+ then by Lemma 01 
p ^ s is satisfied in Ti which contradicts the assumption that r — > s is violated in Ti by Lemma 1121 
We conclude that r — > s is satisfied. The second way that r s could be violated in Ti is if there exist 
two path instances wi. • • • .w„ and u'j^. • • • .v'^ in Paths(s) such that xi ~ yi- If xi — root then the same 
arguments as used in Theorem |31 shows that r — > s is satisfied. If xi root then by Lemma II II there 
either exists s' such that s' G Anc{s) and s' e P+ oi s Dp ^ root. Consider the first possibility. Using 
the same arguments as in Theorem |31 shows that r — > s is satisfied. Consider then the second situation 
where sHp ^ root. There are two cases to consider: (a) rCls — root and (b) rHs^ root. Consider (a). 
Since r ^ s G E and r D s = root then s G P^ or else it could be added at line 6 thus contradicting the 
definition of P^. So if r — > s is violated in Ti then by Lemma 1121 p — > s is violated which contradicts 
Lemma 0]and so we conclude that r s is satisfied in Ti. Consider (b). We now consider the subcases: 
(b.l) p n s is a strict prefix of r n s and (b.2) p n s is not a strict prefix of r n s. Consider (b.l). Suppose 
p — > s G S. Then s G P+ or else it could be added at line 5 which contradicts the definition of P+. So 
by Lemma 0]p s is satisfied in Ti. Suppose then that p ^ r G S. Then r G P^ or else it could be 
added at line 5 which contradicts the definition of P+ and so s G P+ or else it could be added at line 
5 which contradicts the definition of P+. Assume then that p s ^ E and that p — * r ^ S. Then this 
case cannot arise because by Lemma I15l this would imply that xi ^ yi which contradicts the assumption 
that xi — yi. Consider (b.2). Since p H s is not a strict prefix of r n s then r H s is a prefix of p and so 
we must have that r n s G P+ or else it could be added at line 2 which contradicts the definition of P+. 
Then since r n s G P+ it follows that s G P+ or else it could be added at line 4 since r n s is a prefix of 
r n s and r n s is a prefix of r and s. Hence using the same arguments as previously it follows that r — > s 
is satisfied. 

To complete the proof suppose that p ^ q E Since Ti satisfies E then Ti also satisfies p ^ q. 
Suppose firstly that Last{q) G E. If \N{q)\ ^ 1 then using a similar argument to Lemma Elthis would 
imply that p q is violated in T\ which is a contradiction and so \N{iq)\ = 1. Then by Lemma 1141 
q G P+. Suppose instead that that Last{q) ^ E. By definition of Algorithm 1, there exists two nodes 
in N{q), say vi and V2, since Last{q) ^ E and non element nodes are not removed in Algorithm 1. If 
val{vi) ^ val(v2) then using similar arguments to those used in Lemma 1121 it follows that Ti violates 
p ^ q which is a contradiction. Thus we must have that val{vi) ~ val(v2) and so by Lemma ll3l o G P^. 
This completes the proof. □ 
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6 Conclusions 



In this paper we have investigated issues related to the functional dependencies in XML. Such constraints 
are important because of the close relationship between XML and relational databases and also because 
of the importance of functional dependencies in developing a theory of normalization. In an associated 
paper j^l] we defined functional dependencies in XML (XFDs) and provided a set of axioms for reasoning 
about XFD implication. In this paper we have proven prove that the axioms are also complete for 
unary XFDs. The second contribution of the paper has been to prove that the implication problem for 
unary XFDs is decidable and to provide a linear time algorithm for it. These results have considerable 
significance in the development of a theory of normalization for XML documents. In relational databases, 
the classic results on soundness and completeness of Armstrong's axioms 4 and the resulting closure 
algorithm for FD implication play an essential role in determining whether a relation is in one of the 
classic normal forms. Similarly, the results in this paper are an important first step in the development 
of algorithms for testing the normal form proposed in |22| . 

There are several other issues related to the one investigated in this paper that we intend to investigate 
in the future. The main results of this paper have only been established for unary XFDs and there is 
a need to extend the results to arbitrary XFDs. Secondly, the approach adopted in this paper is based 
on the strong satisfaction approach to XFD satisfaction but the techniques we have used can also be 
extended to defining weak satisfaction. In this case there is a need to a develop complete and sound 
axiom system for implication of weak XFDs as well as determining if the implication of weak satisfaction 
is decidable and if so to develop an efficient algorithm for it. Thirdly, there is a need to investigate 
the extension of the other important class of constraints in relational databases, namely multivalued 
dependencies (MVDs) [T3|, to XML We have already completed some preliminary work on this problem 
PT] and in particular we have defined MVDs in XML and proposed a 4NF for XML and shown that it 
eliminates redundancy. However, important issues such as axiom systems for MVDs and the interaction 
between XFDs and MVDs in XML have yet to be investigated. 
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7 Appendix 

Proof of Theorem [T1 

Axiom Al is immediate from the definition of a XFD. 

Consider A2. Suppose that there exists two distinct path instances vi. - ■ • .Vn and v'^. ■ • ■ .v[-^ in Paths{q) 
satisfying the conditions in Definition ^1 Then by Definition E|we must have that 3i, 1 < i < fc, such 
that Xi ^ Hi if Last{pi) G E else 3i, 1 < i < fc, such that ±^ Nodes{xi,pi) and ±^ Nodes{yi,pi) and 
val{Nodes{xi,pi)) D val{Nodes{jji,pi)) = <j). This condition will still hold for the XFD p,pi, ■ ■ ■ ,pk ^ q 
and so A2 is sound. 

Consider A3. Suppose that there exists two distinct path instances -Di. • ■ • .Vn and v'^. ■ ■ ■ .v'^-^ in Paths{s) 
such that. Then since (7 — > s is satisfied, we must have that x'l ^ y'l where x'l — {v\v S {vi, • • • , u„} A f G 
N{q n s)} and y[ ~ {v\v E {v'l, ■ ■ ■ , v'^} A w G N{q n s)}.. Then there must exist path instances ti. • • ■ 
and t[. ■ ■ ■ .t[^ in Paths{q) such that x[ is in ti. ■ ■ ■ .in and y'^ is in i'^. ■ ■ ■ -i^- Since x\ ^ y'^ the two 
path instances must be distinct. Also, since g — > s is satisfied we must have that val(tn) 7^ val{t'^). 
So by the definition of a XFD and since pi, - ■ ■ ,pk — > q is satisfied we must have that 3i, 1 < i < fc, 
such that Xi ^ yi (where Xi and yi are defined as in Definition I14|l if Last{pi) G E else 3i, 1 < i < fc, 
such that ±^ Nodes{xi,pi) and Nodes{yi,pi) and val{Nodes{xi,pi)) C] val{N odes{yi,pi)) — cj) and 
so pi, • ■ • ,p/c s is satisfied. 

Consider A4. Suppose that there exists two distinct path instances vi. ■ ■ ■ .«„ and . • • • .v'^ in Paths{q) 
satisfying the conditions ((w„ =± ^v[^ =_L) V {vn Aw^^ ==_L) V (?;„ =_L hv'^ ^1-) V (w„ ^v[^ ^1. 
Aval{vn) 7^ val{v'^)). Then since Pi Ci q — root, it follows that Xi — yi — root for all i, \ < i < k 
and &o pi, ■ ■ ■ ,pk — > (? is violated which is a contradiction. Hence there cannot exist distinct path 
instances vi. ■ ■ ■ .Vn and v'l. ■ ■ ■ .v'^ in Paths{q) satisfying the conditions (("On =_L /\v'^ —-^) V {vn 
A< =±) V (w„ =_L Aw; 7^±) V {vn A< Aval{vn) val{v'J) and so any XFD p[, ■■■,p'j ^ qis 
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automatically satisfied. 

Consider A5. Suppose firstly that pClqis prefix of p' andp' is a prefix of p and that p' — )■ g is violated. 
Then since p' is a prefix of p we can ignore the case where p' — p and assume that Lastija') G E. So 
for p' —t q to be violated we must have that x'l — y'l where x'l = {v\v S {vi, ■ ■ ■ A f G N{p' n q)} 
and 2/5^ = {t;|w e {wj^, • • • , u^^} A v G -/V(|5' H g)}. However since p' is a prefix of p and p n g is prefix of 
p', it follows that xi = x'j^ and yi = where xi — {\v G {fi, • ■ • A w G iV(pn g)} and yi = {v\v G 
{w^^, • • • A w G -/V(p n q)}. Thus it follows that xi — yi and so Nodes{xi,p) — Nodes{yi,p) and so 
p — > (J is violated which is a contradiction and so p' ^ q is satisfied. Next suppose that p n q is prefix of 
p' and p' is a prefix of g and that p' — > g is violated. Then since p' is a prefix of p we can ignore the case 
where p' — q and assume that Last{p') G E. So for p' ^ g to be violated we must have that x'l = y'l 
where x[ = {v\v G {ui, • • • , Vn] Ave N{p' n q)} and = {v\v G {w^, • • • , w,^} A w G N{p' n (?)}. Then 
since p n g is a prefix of p' this implies that xi — yi which implies a contradiction as before. 

Consider A6. Suppose that there exists two distinct path instances vi. ■ ■ ■ .w„ and v[. ■ ■ ■ .v'„ in Paths{q) 
such that val(vn) ^ val{v'^). Then because g is a prefix of p, pDq — q and so xi = w„ and yi — v'^. Thus 
p — > g is satisfied since val{vn) 7^ val{v!^) 

Consider A7. Suppose that there exists two distinct path instances vi. ■ ■ ■ .t;„ and v[. ■ ■ ■ .v'^ in Paths{q) 
such that val{vn) 7^ waZ(w^). Then because Last{q) G A Parent{vn) Parent{v'^). Also, by definition 
of xi and yi, xi = Parent{vn) and yi — Parent{v'n) and thus xi / ?/i and so Parnt{q) q is Satisfied 
since Last{Parnt{q)) G E. 

Axiom A8 is automatic since there is only one path instance that ends with Vr- CH 

Proof of Theorem [H 

Let E be a set of XFDs and let E+ be the set of XFDs obtained by using Axioms Al - A8. Let p ^ q 
be a XFD that is not in E+. Then to show completeness it suffices to show that there exists a tree T 
that satisfies S but not p ^ q. We consider several cases. 

Case A: Last{p) G E 

We now consider several subcases. The only cases that can arise are: (a) p > q; (b) g > p; (c) p g 
and q ^ p. We firstly note that because of Axiom A6 case (a) cannot arise so the only cases to consider 
are (b) and (c). We consider (c) first. 

Case AA: p q and g p 

Let {pi ^ q, . . . ,pn ^ q} he the set of all XFDs in which have q on the r.h.s (we note that E+ 
can be computed using Algorithm 2 in Section 5). Consider the paths {pi H g, . . . ,p„ H q}. Since each 
of these paths is a prefix of q we can order the set {pi Clq, . . . ,pn H q} according to >. Let Pmin be the 
minimum of {pi H g, . . . ,p„ fl q}. We firstly claim that Pmin 7^ root. If not, then there exists pi q such 
that piClq = root and so by A4 p ^ q E which is a contradiction. Next we claim that Pmin ^ Q- This 
follows from the definition of Pmin and axiom A5. Define the node Pbranch by Pbranch — Po-rnt(pmin)- 
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Construct then a tree T with the following properties. T is complete w.r.t. P^. For all paths p' such 
that p' ripbranch IS a strict prefix of Pbranch, T contains one path instance for p' . If p' Hphranch is not a 
strict prefix of Pbranch then T contains exactly two path instances for p' . Moreover, if Last{p') ^ E then 
the val of the two nodes in N{p') are distinct ii p' ^ q E otherwise they are the same. Such a tree 
always exists. It is also clear from this construction that T violates p ^ q. We also note that T has the 
property that Pmin (and hence ptranch) cannot be a prefix of p. If it was then p — > Pmin by A6 and since 
Pmin — * q then by A3 p — > g G which is a contradiction. 

We illustrate the construction by an example. Let p qhe the XFD root . X root .A.B.C.D.E.E# 
and let S = {root . A . B . C . C# ^ root . A . B . C . D . E . E#, root . A . B . C . D . D# root . A . B . C . D . E . E#, 

root.X.X# root. A}. Then the above construction procedure yields the tree T shown in Figure 




'fr' "fi" 



Figure 12: A XML tree 

We note that in the construction of T, the correct place to branch the tree is crucial. If the tree 
branches above or below ptranch then T will not satisfy E. 

It is clear from this construction that T violates p — s- g so it remains to prove that T satisfies E. Let 
p' q' be any XFD (not necessarily in S). There are several cases to consider depending on where p' 
and q' are in the tree T. The different cases can be best illustrated by Figure ^| In this figure we use 
subscripts to denote different instances of a path. For example, qi and q2 denote different path instances 
of the path q and q^i and qy2 denote different path instances of the path q^. We shall consider all possible 
cases where p' q' could be violated in T and show that cither p' q' cannot be in S or T satisfies 
p' ^ q'. 

Case AAA: p' — Px, i.e. p' f\q = root 
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Figure 13: A XML tree 

Case AAA.l: qw > q' > Pmin 

Suppose that p' ^ q' € E. Since q' > pmin and G E (since qw > q' ) it follows from A6 or Al that 
q' — » Pmin- Then since pmin q and applying A3 twice we derive that p' q. Then since p' Cl q = root 
by A4 we derive that p ^ q € S"*" which is a contradiction and so we conclude that p' ^ q' ^ S. 

Case AAA. 2: q' = q^, i-e. q > q' Ci q > Pmin and Last{q') ^ E 

Suppose that p' —> q' € H and p' q' is violated in T. Then for this to happen the two nodes in 
N{q') must have different val's and so by the construction of T, q' ^ q E S+ and so by A3 p' —>■ q. 
By A4 this implies that p ^ q G S+ which is a contradiction and so we conclude that either p' — > q' is 
satisfied in T or p' ^ q' ^ H . 

Case AAA. 3: q' = q 

li p' ^ q' € T, then by A4 p ^ q E S+ which is a contradiction and so we conclude that p' ^ q' ^ S. 
Case AAA. 4: q' > q and Last{q') G E 

Suppose p' ^ q' E S. Since q' > q and q > Pmim then q' Pmin & S+ by A6 and since Pmin q 
by A3 this implies that q' q. So if p' q' then by A3 p' ^ q so by A4 p q & S+ which is a 
contradiction and no p' q' ^ Yi. 

Case AAA. 5: q' ^ q^, i.e. q' > q and Last{q') ^ E 

As for Case AAA. 2 

Case AAB: p^ > p' 
As for Case AAA. 

Case AAC: p' = Pr, i.e. priq>p'riq> root 
Case AAC.l: q' = q 

If p' —> q' € then it contradicts the definition of Pmin and so p' ^ q' ^ S. 
Case AAC. 2: q^ > q' > Pmin 
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Suppose that p' ^ q' G S. Since q' > Pmin and G E it follows from A6 that q' Pmin- Then since 
Prnin ^ 9 and applying A3 twice we derive that p' ^ q G E+ which contradicts the definition of Pmin 
and so p' q' ^ S. 

Case AAC.3; q' = q^j, i.e. q> q' C\q> Pmin and Last{q') ^ E 

Suppose that p' ^ g' G S and p' q' is violated in T. Then for this to happen the two nodes in 
N{q') must have different val's and so by the construction oi T, q' ^ q E S+ and so by A3 p' q. By 
A4 this implies that p ^ q E S+ which contradicts the definition of Pmin and so we conclude that either 
p' q' is satisfied in T or p' ^ g' ^ S. 

Case AAC.4: q' > q and Last{q') G E 

Suppose that p' ^ q' £ S. Since q' > q and q > Pmin, then q' Pmin G S+ and since Pmin — > <? it 
follows by A3 that p' ^ q which contradicts the definition of Pmm and so p' —>^ q' ^ S. 
Case AAC.5: q' = q^, i.e. q' > q and Last{q') ^ E 

If p' (7' is violated in T, then it follows by construction of T that q' q and so by A3 p' ^ q which 
contradicts the definition of Pmin and so p' q' is satisfied. 

Case AAD: pr > p' 
Case AAD.I: q' = q 

Assume that p' ^ q' E T,. If p' is a prefix of p then by A6 p ^ p' and so since p' ^ q it follows by A3 
that p ^ q E S"*" which is a contradiction and so p' —>■ q' ^ Y,. If p' is not a prefix of p then it follows that 
from the fact that p' ^ q and A5 that p' Cl p ^ q and since p p' dp hy A6 then applying A3 derives 
the contradiction that p ^ q E S+ and so p' ^ q' ^Y,. 

Case AAD. 2: qw > q' > Pmin 

Assume that p' ^ q' E Y. Then p' H q ^ q by A4. However by definition of p', p' n g is a prefix of p 
and so by A6 p — > p' n g and so by A3 p — » g which is a contradiction and so p' — * ^ E. 
Case AAD. 3: q' = qw, i-e. q > q' riq> Pmin and Last{q') ^ E 

Suppose that p' ^ q' € T, and p' — > q' is violated in T. Then for this to happen the two nodes in 
N{q') must have different val's and so by the construction oiT, q' ^ q E E+ and so by A3 p' q. 
However using the same argument as in AAD. 2, if p' ^ g then p ^ q E E+ which is a contradiction and 
so we conclude that either p' q' is satisfied in T or p' — > g' ^ S . 

Case AAD.4: q' > q and Last{q') E E 

Assume that p' ^ q' E S. As in AAC.4 we derive that p' ^ q and so using the same argument as in 
AAD. 2 we derive the contradiction that p ^ q E S+ and so p' ^ q' ^ E. 
Case AAD. 5: q' = q„, i.e. q' > q and Last{q') ^ E 

As in AAC.5 we derive that p' ^ q and so using the same argument as in AAD. 2 we derive the 
contradiction that p ^ q. 
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Case AAE: p' = Ps, i.e. Pmin >p'r\q>pr\q 
As for case for Case AAC. 

Case AAF: ps > p' 

We can assume that p fl g is a prefix of p' or else the case reduces to case A AD. 
Case AAF.l: q' = q 

Assume that p' ^ q' Cz S. Since p' q, by A5 wc have that p' H q q and since ps > p' it follows 
that p' n g is a strict prefix of pmin which contradicts the definition of Pmin and so p' ^ g' ^ E. 
Case AAF. 2: q^ > q' > pmin 

Suppose that p' —* q' Cz S. Since Pmin —> q it follows from A5 and the definition of q' that q' Cl q ^ q. 
However since q' > Pmin then Last{q') G E and so since q' Cl q is prefix of q' it follows that q' q' H q. 
So applying A3 twice we derive thcit — > (j which contradicts the definition of Pmin 

and so p' ^ q' ^ S. 

Case AAF. 3: q' = q^, i.e. q > q' Dq > Pmin and Last{q') ^ E 

Suppose that p' —> q' €z T, and p' q' is violated in T. Then for this to happen the two nodes in 
N{q') must have different val's and so by the construction of T, g' — ^ q G S+ and so by A3 p' — » g G S"*" 
which again contradicts the definition oi pmin- So we conclude that either p' q' is satisfied in T or 
p'^q'i^. 

Case AAF. 4: q' > q and Last{q') G E 

Suppose that p' ^ q' G S. Since q' > q and q > Pmin, then q' Prnin G by A6. Then since 
Pmin ^ q, it follows by applying A3 three times that p' —>^ q ^ S+ which contradicts the definition of 
Pmin and so p' ^ g' ^ S. 

Case AAF. 5: q' = q^, i.e. q' > q and Last{q') ^ E 

Suppose that p' ^ g' G S and p' q' is violated in T. If p' ^ g' is violated in T then by definition 
of T we must have that q' q and so by A3 p' ^ g which contradicts the definition of Pmin- So we 
conclude that either p' q' is satisfied in T or p' ^ g' ^ E. 

Case AAG: p' = pt, i.e. p>p' ng>png 
Case AAG.l: g^ = g 

If p' — » g e E then by A5 p' n g ^ g G E+ and by A6 p ^ p' n g and so by A3 p ^ g G E+ which is a 
contradiction. and so p' ^ g ^ S. 
Case AAG.2: q^ > q' > Pmm 

If p' q' & E, then by A6 g' Pmin and since Pmin ^ g it follows by applying A3 twice that p' — > g. 
Then by A5 p' n g ^ g and since p ^ p' n g by A6 we derive the contradiction p — > g G E+ and so 
p' ^ g' ^ E. 

Case AAG. 3: g' = g^,, i.e. g > g' n g > Pmin and Last{q') ^ E 

Suppose that p' — > g' G E and p' — > g' is violated in T. Then for this to happen the two nodes in 
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N{q') must have different val's and so by the construction of T, g' ^ g G S+ and so by A3 p' ^ q ^ S"*" 
which leads to a contradictionas in Case AAD. So we conclude that either p' — > q' is satisfied in T or 
p' <Z' ^ S . 

Case AAG.4: q' > q and Last{q') e E 

Using the same reasoning as in Case AAG.2, if p' ^ q' € E, then we derive the contradiction 
p ^ q eT.+ and so p' q' S. 

Case AAG.5: q' — q^, i.e. q' > q and Last{q') ^ E 

Suppose that p' ^ q' ^ and p' — > g' is violated in T . Then by the construction of T , for this to 
happen we must have that q' ^ q and so by A3 p' — * q. Then using the same reasoning as in AAG.2 
we derive the contradiction p q G S"*". So we conclude that either p' q' is satisfied in T or p' ^ q' ^ E. 

Case AAH: pt > p' > Pmin 
As for case A AG. 

Case AAI: p' ~ Pu, i-e. p' > p and Last{p') ^ E 
Case AALl: q' ^ q 

If p' — > g e E then since p is a prefix of p' it follows by A5 that p ^ q € S+ which is a contradiction 
and so p' ^ q ^ T,. 

Case AAL2: g^, > > Pmin 
Same as Case AAG.2. 

Case AAL3: q' = qw, i-e. q> q' C\q> Pmin and Last{q') ^ E 

As for Case AAG.3. 

Case AAI. 4: q' > q and Last{q') S E 

As for AAG.2. 

Case AAI. 5; q' ^ qy, i.e. q' > q and Last{q') ^ E 
As for case AAG.5. 

Case AAJ: pu> p' > p. 
As for Case AAI. 

Case AAK: p' = q^, i.e. g > p' fl g > Pmin and Last{p') ^ E 
Case AAK.l: q' = q 

By the construction of T, if p' — » g' G S then the two nodes in A''(p') have distinct val and so p' — » g' 
is satisfied in T. 

Case AAK.2: q^ > q' > Pmin 

By A6 it follows that q' Pmin and since Pmin ^ 5 it follows by A3 that q' q. So if p' q' then 
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by A3 we have that p' q. Hence by the construction of T the two nodes in N{p') must have different 
val's and so by the definition of a XFD p' q' must be satisfied in T. 
Case AAK.3: q' > q and Last{q') e E 

As for case AAG.2 it follows that q' q, and so from A3 p' ^ q which, using the same reasoning as 
in case AAK.2, implies that p' ~^ q' is satisfied in T. 
Case AAK.4: q' = q^, i.e. q' > q and Last{q') ^ E 

Suppose that p' ^ q' G Y] and p' —^ q' is violated in T. For this to happen we must have the two 
nodes in N{p') have the same val and the two nodes in N{q') have different val's. However, by the 
construction of T if the two nodes in N{q') have different val's then q' ^ q E E+. So applying A3 we 
derive that p' ^ q Cz and by the definition of T this implies that the two nodes in N(p') must have 
different val's which is a contradiction. So p' ^ q' ^ E or p' q' is satisfied in T. 

Case AAL: p' = qv, i.e. i.e. q' > q and Last{q') ^ E 

Case AAL.l: q' = q 

As for case AAK.l. 

Case AAL. 2: > g' > pmin 

As for case AAK.2 

Case AAL. 3: g' > g and Last{g') e E 
As for case AAK.3 

Case AB: q> p 

We firstly note that because of axiom A7 we can rule out the case where q G Att{p). Let {pi 
q, . . . ,p„ — i- g} be the set of all XFDs in S+ which have q on the r.h.s (we note that E+ can be computed 
using Algorithm 2 in Section 6). Consider the paths {pi D q, . . . ,pn H q}. Since each of these paths is 
a prefix of q we can order the set {pi n g, . . . ,p„ n q} according to >. Let Pmini be the minimum of 
{Pi ■ ■ ■ iPnC^l} such that Pmini > P- Wc uotc that Prnini ^nd p are comparable since both are prefixes 
of q. Define the node Pbranchi by Pbranchi ~ Pdrntijpmini)- We also note that since pi q, it follows 
from axiom A5 that Pmini Q- 

Construct then a tree T with the following properties T is complete w.r.t. Pe. For all paths p' such 
that p' n Pbranchi is a strict prefix of pbranchi , T contains one path instance for p' . If p' n pbranchi is not a 
strict prefix of Pbranchi then T contains two path instances for p' . Moreover, if Last{p') ^ E then the val 
of the two nodes in N{p') are distinct ii p' ^ q <E otherwise they are the same. Such a tree always 
exists. It is also clear from this construction that T violates p ^ q. As before, we note that the decision 
of where to branch the tree is critical in the construction of a tree which satisfies S. We claim that T 
satisfies S. As before we let p' q' be any XFD (not necessarily in E). The various cases that can arise 
are illustrated in Figure El In this figure, as previously, we use subscripts to denote different instances 
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of a path. Wc shall consider all possible cases where j?' q' could be violated in T and show that either 
p' q' cannot be in E or T satisfies p' ^ q' ■ 



root 




t t 

qvi qv2 



Figure 14: A XML tree 

Case ABA: p' = Px, i.e. p' Dq = root 
Case ABA.l: gw > q' > Pmmi 

Suppose that p' ^ q' E S. Then by A6 q' Pmini and Pmini Q and so by A3 p' ^ q G Then 
by A4 p ^ q E E+ which is a contradiction and so p' ^ q' ^ T,. 
Case ABA. 2: q' = g^,, i.e. q > q' Ci q > Pmin and Last{q') ^ E 
As for case AAA.2. 
Case ABA.3: q' = q 
As for case AAA.3. 
Case ABA.4: q' > q and Last{q') £ E 

Suppose that p' ^ q' Cz S. Since q' > q and q > Pmini, then q' — > Pmini S S+ by A6 and since by 
definition Pmini ^ 9 by A3 this implies that q' q. As for case ABA.l this implies p q G which 
is a contradiction and so p' ^ q' <^ T,. 

Case ABA. 5: q' = qv, i-e. q' > q and Last{q') ^ E 

As for Case AAA.2 

Case ABB: p^ > p' 
As for Case AAA. 

Case ABC: p' = Pr, i.e. priq>p'riq> root 
Case ABC.l: q' = q 

If p' — > q e S then by A5 p — > g G S"*" which is a contradiction and so — > g ^ S. 
Case ABC. 2: q^ > q' > Pmini 

Suppose that p' ^ q' € S. Since q' > Pmini and since g' G E it follows from A6 that g' Pmini- 
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Also since Pmini ^ <? by applying A3 twice we derive that p' ^ q and so by A5 p ^ q € S"*" which is a 
contradiction and so p' ^ q' S. 

Case ABC. 3: q' = q^, i-e. q > q' Ci q > Pmin and Last{q') ^ E 

Suppose that p' — > g' e S and p' q' is violated in T. Then for this to happen the two nodes in 
N{q') must have different vaV s and so by the construction of T, g' ^ q G S+ and so by A3 p' ^ q G S"*" 
which, as for case ABC. 2, is a contradiction. So we conclude that either p' — > q' is satisfied in T or 
y ^ <Z' ^ E . 

Case ABC.4: q' > q and Last{q') G E 

Suppose that p' ^ g' G S and p' q' is violated in T . Since q' > q and q > Pmini then from A6 
q' Pmini and since Pmini — > it follows by A3 that p' q which, as in case ABC. 2, is a contradiction. 
So p' q' is satisfied in T or p' ^ q' ^ Y,. 

Case ABC. 5: q' = qv, i.e. q' > q and Last{q') ^ E 

As for case ABC. 3 

Case ABD: p' = pt, i.e. p>p'riq>pr\q 
Case ABD.l: q' ^ q 

p' q' cannot be in S or else it contradicts the definition of Pmini ■ 
Case ABD. 2: > q' > Pmini 

If p' ~> q' E, then by the same reasoning as in AAG.2 p' ^ q which contradicts the definition of 

Pmini • 

Case ABD. 3: q' — qw, i-C. q > q' H q > Pmin and Last{q') ^ E 

Suppose that p' ^ g' g S and p' q' is violated in T. Then for this to happen the two nodes in 
N{q') must have different vaV s and so by the construction of T . q' ^ q E E+ and so by A3 p' q. By 
A4 this implies that p ^ q E E+ which contradicts the definition ofpmini and so we conclude that either 
p' q' is satisfied in T oy p' ^ q' ^T, . 

Case ABD. 4: q' > q and Last{q') e E 

Suppose that p' q' E S. Since q' > q and q > Pmini then by A6 q' Pmini and since Pmini q 
by A3 it follows that p' ^ q E E^. However this contradicts the definition of Pmini and so p' ^ q' ^ S. 
Case ABD. 5: q' ^ qv, i.e. q' > q and Last{q') ^ E 
As for ABC.3 

Case ABE: pt > p' 
As for case ABD. 

Case ABF: p' = q^, i.e. q > p' H q > Pmin and Last{p') ^ E 
Case ABF.l: q' = q 
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As for case AAK.l. 

Case ABF.2: qm > q' > Pmmi 

By A6 it follows that q' Pmini a-nd since Pmini — > 9 it follows by A3 that q' q. Then following 
AAK.2, p' q' must be satisfied in T. 
Case ABF.3: q' > q and Last{q') e E 

As for case A AG. 2 it follows that q' ^ q, and so from A3 p' ^ q- So by construction of T the two 
nodes in Nodes(p') have different val's and so p' q' is satisfied. 
Case ABF.4: q' = gr„, i.e. q' > q and Last{q') ^ E 
As for AAK.4. 

Case ABC: p' = qv, i.e. p' > qDp' and Last(p') ^ E 

Case ABG.f: g' = q 

As for case AAK.l. 

Case ABC. 2: qw > q' > Pmini 

As for case AAK.2 

Case ABG.3: q' > q and Last{q') G E 
As for case AAK.3 

Case ABH: p' ^ p 

Case ABH.l: q^ > q' > Pmini 

Suppose that p' — > g'' e S. By A6 it follows that q' Pmini and since Pmini ^ <; it follows by A3 
that q' q. Then since p' — > q' applying A3 means that p q G S"*" which is a contradiction and so 
p'^q'^ S. 

Case ABH. 2: q' = q^,, i.e. q> q' C\q> Pmin and Last{q') ^ E 

Suppose that p' ^ q' and p' — > q' is violated in T. Then for this to happen the two nodes in 
N{q') must have different val's and so by the construction ofT, q' ^ q G E+ and so by A3 p ^ g which 
is a contradiction. So we conclude that either p' q' is satisfied in T or p' ^ g' ^ S . 

Case ABH.3: q' > q and Last{q') € E 

As for Case ABH.l. 

Case ABH.4: q' = qv, i.e. q' > q and Last{q') ^ E 
As for Case ABH.2. 

Case B: Last{p) ^ E 

There are only two cases two consider: (a) p > q; (^) p'^ q and q'i' p- We consider (b) first. 
Case BA; p~:f' q and q~:f' p 
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Let Pmin and ptranch be defined as in Case AA. We now consider the two subcases where p > Pbranch 
and p Pbranch- Wc now consider the first case. 
Case BAA: p > pbranch 

Construct then a tree T with the foUowing properties. Firstly T is complete w.r.t. P^. For all paths 
p' such that p' n Pbranch is a strict prefix of Pbranch, T contains one path instance for p' . If p' Hpbranch is 
not a strict prefix of Pbranch then T contains two path instances for p' with the following properties. If 
p' = p then the vol of the two nodes in N{p) is the same, otherwise if Last{p') ^ E then the vol of the 
two nodes in N{p') are distinct if p' ^ g G otherwise they are the same. Such a tree always exists 

It is clear from this construction that T violates p — > g so it remains to prove that T satisfies E. We 
let p' q' be any XFD (not necessarily in S). We shall consider all possible cases where p' q' may be 
violated, and show that cither p' — > q' is satisfied in T or p' — > q' cannot be E. There are several cases 
to consider depending on where p' and q' are in the tree. The different cases can be best illustrated by 
Figure [Tsl 




Figure 15: A XML tree 

Case BAAA: p' — Px, i.e. p' Ci q = root 
Case BAAA.l: > q' > Pmin 
As for Case AAA.l. 

Case BAAA. 2: q' — qw, i.e. q> q' f\q> Pmin and Last{q') ^ E 

As for Case AAA. 2. 

Case BAAA.3: q' ^ q 

As for Case AAA. 3. 

Case BAAA.4: q' > q and Last{q') £ E 

As for Case AAA. 4 

Case BAAA.5: q' = i.e. q' > q and Last{q') ^ E 
As for Case AAA. 2 

Case BAAA. 6: q' — qs, i.e. pr\q> pCiq' > Pmin 
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Suppose that p' —> q' € "E and p' q' is violated in T. Then, by construction of T, we have that 
qs ^ q & S+ and so by A3 p' ^ q and so by A4 p ^ g e S+ which is a contradiction. So p' — *■ q' is 
satisfied or p' ^ q' ^ S. 

Case BAAA.7: qs > q' > Pmm 

Suppose that p' ^ g' G S. If (js > g' and q' > pmin then g' Pmin by A6 and since Pmin Q then, 
by A3, p' —f q which imphes a contradiction as in the previous case and so p' ^ q' ^ S. 
Case BAAA.8: q' = qy, i.e. q> q' >pnq 
As for BAAA.6. 
Case BAAA.9: qy > q' >pr]q 
As for AAA.l. 

Case BAAA.IO: q' = p„, i.e. q' > p and Laat(q) ^ E 

If T violates p' q' , then by construction of T we have that q' ^ q and so by A3 p' ^ q which 
implies a contradiction as in case BAAA.6 and so p' q' is satisfied. 
Case B AAA. 11: pu > q' > p 

Suppose that p' ^ q' G S. By A6 q' Prnin and, since Pmin ^ 9 by A3 p' ^ q which implies a 
contradiction as in BAAA.6 and so p' ^ q' ^ S. 

Case BAAB: p,. > p' 
As for Case BAAA. 

Case BAAC: p' — p,., i.e. priq>p'riq> root 
Case BAAC.l: qw > q' > Pmin 
As for case AAC.l 

Case BAAC. 2: q' = q^, i.e. q> q' r\q> Pmin and Last{q') ^ E 
As for case A AC. 3 
Case BAAC.3: q' =^ q 
As for case AAC.l. 

Case BAAC.4: q' > q and Last{q') e E 
As for Case AAC.4 
Case BAAC.5: q' = v 
As for Case AAC.5 

Case BAAC. 6: q' = qg, i.e. priq>pr\q'> Pmin and Last{q') ^ E 

Suppose p' ^ q' & E. As in case BAAA.6 we can derive that p' ^ q G S+ which contradicts the 
definition of Pmin and so p' ^ q' ^ E. 
Case BAAC. 7: q^ > q' > Pmin 
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Suppose p' ^ q' € S. As in Case BAAA.7 we can derive that p' ^ q which contradicts the definition 
of Pmin and so p' ^ q' ^E. 

Case BAAC.8: q' = qy, i.e. q> q' >pf^q 

As for BAAC.6. 

Case BAAC.9: qy > q' >pr]q 

As for BAAC.7. 

Case BAAC.IO: q' = pu, i.e. q' > p and Last{q') ^ E 

Suppose p' ^ q' ^ S. As in BAAA.IO we can derive that p' ^ q which which contradicts the definition 
of pmin and so p' ^ q' ^E. 
Case BAAC.ll: Pu > q' > P 

Suppose p' ^ q' E S. As in BAAA.ll we can derive that p' ^ q which which contradicts the defini- 
tion of Pmin and so p' ^ g' ^ S. 

Case BAAD: p,. > p' 
Case BAAD.l: q' = q 

As for Case AAD.l. The other cases are similar to the corresponding BAAC cases. 
Case BAAE: p' = q,,, i.e. pr]q>pnq'> 

Pmin and Last{q') ^ E 

Case BAAE.l: q,,,, > q' > Pmin 

Suppose that p' q' E T, and p' q' is violated . Then since q' > Pmin and Last{q') G E then by A6 
q' Pmin and Pmin ^ 5 it foUows by applying A3 twice that p' ^ q <E S^. However, by the construction 
of T if p' —> q E then the two nodes in N{p') must have different val's. However, for p' — » q' to be 
violated the two nodes in N{p') must have the same val's which is a contradiction. So we conclude that 
either p' — *• q' is satisfied or p' ^ q' ^ T,. 

Case BAAE. 2: q' = q^, i.e. q > q' r\q> Pmin and Last{q') ^ E 

Suppose that p' — > g' € E and that p' q' is violated in T. Then for this to happen the two nodes in 
N{p') must have the same val and the two nodes in N{q') must have different val's. So by construction 
of T, q' ^ q G E+ and hence by A3 this implies that p' ^ q £ S+. However, by the construction of 
T, this implies that the two nodes in N{p') must have different val's which is a contradiction and so 
p' ^ q' ^T, OT p' ^ q' is satisfied in T. 

Case BAAE.3: q' = q 

If p' ^ q' G T, then by the construction of T the two nodes in N{p') have diffferent val's and so 
p' — > q' is satisfied. 

Case BAAE.4: q' > q and Last{q') € E 

Suppose that p' ^ q' G E. Then by A6 q' — > Pmin and since Pmin ^ g by A3 we have that 
p' ^ q € S+. However, by the construction of T, this implies that the two nodes in N{p') have diffferent 
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val's and so p' — > q' is satisfied. 

Case BAAE.5: q' — q^, i.e. q' > q and Last{q') ^ E 

Suppose that p' q' ^ S. As for case AAC.5 we derive tliat p' ^ q G S"*" and so, as for case BAAE.4, 
this imphcs p' ^ q' is satisfied. 

Case BAAE.6: q' = qy, i.e. q> q' >pnq 

Suppose that p' q' is violated in T. For this to happen the two nodes in N{q') must have different 
val's and so by the definition oiT q' ^ q E S+. Then by A3 p' ^ q E S+ and so by definition of T the 
two nodes in N{p') must have different val's which contradicts tha fact that p' q' is violated and so 
p' q' is satisfied. 

Case BAAE.7: qy>q'>pC\q 

As for case BAAE.l. 

Case BAAE.8: qs > q' > Pmin 

As for BAAA.7 we derive that p' q. So by the definition of T the two nodes in N{p') must have 
different val' s and so p' q' must be satisfied. 

Case BAAE.9: q' = pu, i.e. q' > p and Last{q') ^ E 

As for case BAAE.2. 

Case BAAE.IO: pu > q' > p 

As for case BAAE.l. 

Case BAAF: p' = q^, i.e. q > p' H q > Pmin and Last{p') ^ E 
Case BAAE.l: > q' > Pmtn 
As for case BAAE.l. 

Case BAAF. 2: q' = q^, i.e. pr\q>pC]q'> pmin and Last{q') ^ E 

As for case BAAE.2. 

Case BAAF.3: q' = q 

As for case BAAE.3. 

Case BAAF.4: q' > q and Last{q') € E 

As for case BAAE.4. 

Case BAAF. 5: q' = qv, i-e. q' > q and Last{q') ^ E 
As for case BAAE.5. 

Case BAAF. 6: q' = qy, i.e. q> q' >pr\q 

As for BAAE.6. 

Case BAAF. 7: qy > q' >pr]q 

As for case BAAE.7. 

Case BAAF.8: qs > q' > Prmn 

As for BAAE.8. 
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Case BAAF.9: q' ^ Pu, i.e. q' > p and Last{q') ^ E 

As for case BAAE.2. 

Case BAAF.IO: Pu > q' > p 

As for case BAAA.l 

Case BAAG: p' — qy, i.e. q> q' >priq 
As for case BAAF. 

Case BAAH: p' = pu, i.e. p' > p and Last{p') ^ E 
Case BAAH.l: q^ > q' > Pmm 
As for case BAAE.l. 

Case BAAH. 2: q' — g^, i.e. pf\q>pC\q' > Pmin and Last{q') ^ E 

As for case BAAF. 2. 

Case BAAH.3: q' = q 

As for case BAAF. 3. 

Case BAAH.4: q' > q and Last{q') G E 

As for case BAAF.4. 

Case BAAH. 5: q' = i.e. q' > q and Last{q') ^ E 
As for case BAAF. 5. 

Case BAAH.6: q' = qy, i.e. q> q' >pnq 

As for BAAF.6. 

Case BAAH. 7: qy>q'>pnq 

As for case BAAF. 7. 

Case BAAH.8: qs > q' > Pmin 

As for BAAF.8. 

Case BAAH. 9: q' = qw, i.e. q > q' r\q> Pmin and Last{q') ^ E 

As for case BAAE.6. 

Case BAAH.IO: Pu> q' > P 

As for case BAAE.l 

Case BAAI: p' = q^, i.e. p' > q and Last{p') ^ E 
As for case BAAH. 

Case BAA J: p' = p 

Case BAAJ.l: q^ > q' > Pmin 
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Suppose p' ^ q' E S. Following case BAAE.l we derive that p ^ q E S+ which is a contradiction, 
and so p' ^ q' ^ E. 

Case BAAJ.2: q' = q^, i.e. q> q' C\q> Pmin and Last{q') ^ E 

Suppose that p' ^ g' £ S and p' — > q' is violated in T. Then for this to happen the two nodes in 
N{q') must have different val's and so by the construction oi T, q' ^ q E S+ and so by A3 p' q. 
By A4 this implies that p ^ q £ S+ which is a contradiction and so we conclude that either p' q' is 
satisfied in T or p' ^ q' ^ S . 

Case BAAJ.3: q' > q and Last{q') £ E 

Assume that p' ^ q' G E. Then as in case BAAE.4 we derive the contradiction that p ^ q E E+ and 
so p' q' E. 

Case BAAJ.4: q' = i.e. q' > q and Last{q') ^ E 

As for case BAAJ.2. 

case BAAJ.5: q' > q and Last{q') e E 

Assume that p' ^ q' G E. Then as in case BAAE.4 we derive the contradiction that p ^ q E E+ and 
so p' —> q' <f E. 

Case BAAJ.6: q' — qy, i.e. q > q' > pCiq 

As for BAAJ.2. 

Case BAAJ.7: qy>q'>pC\q 

Assume that p — * G E. Then as in case BAAE.l we derive the contradiction that p ^ q E E+ and 
so p' —> q' <f. E. 

Case BAAJ.8: q' — qs, i.e. pr\q> pCiq' > Pmin and Last{q') ^ E 
As for case BAAJ.2. 
Case BAAJ.9: qs > q' 

Assume that p q' e E. Then as in case BAAC.7 we derive the contradiction that p q E E+ and 
so p' q' <f E. 

Case BAAJ.IO: q' — Pu, i-e. q' > p and Last{q') ^ E 

As for case BAAJ.2. 

Case BAAJ.ll: pu > q' > p 

Assume that p q' e E. Then as in case BAAE.l we derive the contradiction that p q E E+ and 
so p' ^ q' ^ E. 

Case BAB: p 'j' Phranch 

Construct a tree T as in Case AB. To show that T satisfies E we let p' q' be any XFD in E. We 
then consider all possible cases where p' — > q' may be violated, and show that either p' — + q' is satisfied 
in T oi p' ^ q' cannot be E. The different cases are illustrated in Figure 1161 Then the same arguments 
as in Case AB shows that T satisfies E. 
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root 




t t 

qvi qv2 



Figure 16: A XML tree 

Case BB: p > q 

The first thing we nore is that q ^ root because of A8. Construct then a tree T as in case BAA. To 
show that T satisfies S we let p' — > q' be any XFD in S. We then consider all possible cases where p' q' 
may be violated, and show that either p' —>■ q' is satisfied in T or p' — > q' cannot be S. The different 
cases are illustrated in Figure El Then the same arguments as in Case BAA shows that T satisfies S. 



root 




Put P"2 



Figure 17: A XML tree 

□ 
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